We’ve obtained data on over 800 employees in a fortune 100 company and analyzed different factors that lead to attrition.
## Load packages
# install.packages("ggplot2")
# install.packages("GGally")
# install.packages("dplyr")
# install.packages("tidyverse")
# install.packages("e1071")
# install.packages("class")
# install.packages("readr")
# install.packages("rcurl")
# install.packages("caret")
#Load Libraries
library(readr)
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble 3.1.3 ✓ stringr 1.4.0
## ✓ tidyr 1.1.4 ✓ forcats 0.5.1
## ✓ purrr 0.3.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(e1071)
library(class)
library(RCurl)
##
## Attaching package: 'RCurl'
## The following object is masked from 'package:tidyr':
##
## complete
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
#Load data
x <- getURL("https://raw.githubusercontent.com/AlonsoSalcido/CaseStudy2DDS-/main/CaseStudy.csv")
CaseStudy <- read.csv(text = x)
The first step in our exploratory data analysis is to measure and visualize current attrition in the company. We currently have a 19% Attrition rate in the company.
EDA= CaseStudy
#Irrelevant Columns
drop= c("EmployeeCount", "EmployeeNumber", "Over18", "StandardHours", "Attrition.1")
EDA= EDA[ , !(names(EDA) %in% drop)]
#Graph Attrition
ggplot(EDA, aes(x="", y=Attrition, fill=Attrition)) +
geom_bar(stat="identity", width=1) +
ggtitle("Current Attrition Percent")+
coord_polar("y")
#Attrition by Numbers
AttritionYes=nrow(EDA[EDA$Attrition == "Yes",])
AttritionNo=nrow(EDA[EDA$Attrition == "No",])
AttritionPercent= nrow(EDA[EDA$Attrition == "Yes",])/nrow(EDA[EDA$Attrition == "No",])
AttritionYes
## [1] 140
AttritionNo
## [1] 730
AttritionPercent
## [1] 0.1917808
For this step, the dataset has been broken down in two groups: Yes Attrition and No Attrition.
We’re looking for differences (in term of percent) between the two groups. If a variable has no impact in attrition it should have a similar percentage rate in both groups. If a substantial difference is seen between the groups, the variable may have an impact in attrition.
No substantial difference is seen in travel. The biggest difference on this category is seen in those who travel frequently. They have a larger percentage in the Attririon group than in the No Attrition group.
#Travel
ggplot(EDA,mapping=aes(x=Attrition,fill=BusinessTravel))+
geom_bar(position="fill")+
ggtitle("Attrition by Travel")+
ylab("Percentage")+
scale_y_continuous(labels = scales::percent_format())
##Department
When we looked at differences between Departments we noticed that the Sales department has a substantial difference between the two Groups. They tend to have higher attrition rate.
Research and Development has the lowest attrition.
ggplot(EDA,mapping=aes(x=Attrition,fill=Department))+
geom_bar(position="fill")+
ggtitle("Attrition by Department")+
ylab("Percentage")+
scale_y_continuous(labels = scales::percent_format())
##EducationField On the education field, those with a technical degree tend to have a higher attrion rate than the rest.
ggplot(EDA,mapping=aes(x=Attrition,fill=EducationField))+
geom_bar(position="fill")+
ggtitle("Attrition by EducationField")+
ylab("Percentage")+
scale_y_continuous(labels = scales::percent_format())
##Gender
After visualizing gender we can discard it as a factor of attrition.
ggplot(EDA,mapping=aes(x=Attrition,fill=Gender))+
geom_bar(position="fill")+
ggtitle("Attrition by Gender")+
ylab("Percentage")+
scale_y_continuous(labels = scales::percent_format())
#JobRole Job Role is an interesting variable since it has multiple categories. The job role that we will focus on is Sales Representative since it has a substantial difference between the attrition and no attrition group.
Sales Representatives tend to be more prone to attritions than any other job role in the company.
ggplot(EDA,mapping=aes(x=Attrition,fill=JobRole))+
geom_bar(position="fill")+
ggtitle("Attrition by JobRole")+
ylab("Percentage")+
scale_y_continuous(labels = scales::percent_format())
##MaritalStatus Marital Status can be an important factor in attrition. Those employees who are single make up a substantially bigger portion in the attrition group than in the no attrition group.
On the contrary Divorced employees tend to be less likely to voluntarily resign.
ggplot(EDA,mapping=aes(x=Attrition,fill=MaritalStatus))+
geom_bar(position="fill")+
ggtitle("Attrition by MaritalStatus")+
ylab("Percentage")+
scale_y_continuous(labels = scales::percent_format())
##Overtime
Over Time is one of the variables with the highest difference among the attrition and no attrition groups.
Out of all the employees in the attrition group, 57% worked over time, while only 24% of employees that work over time make up the no attrition group. This is a group that we need to keep an eye on.
ggplot(EDA,mapping=aes(x=Attrition,fill=OverTime))+
geom_bar(position="fill")+
ggtitle("Attrition by OverTime")+
ylab("Percentage")+
scale_y_continuous(labels = scales::percent_format())
nrow(EDA[EDA$Attrition == "Yes" & EDA$OverTime == "Yes",])/nrow(EDA[EDA$Attrition == "Yes",])
## [1] 0.5714286
nrow(EDA[EDA$Attrition == "No" & EDA$OverTime == "Yes",])/nrow(EDA[EDA$Attrition == "No",])
## [1] 0.2356164
NEDA = EDA[,c(3, 1:32)]
NEDA$Attrition=gsub('No', 0 , NEDA$Attrition)
NEDA$Attrition=gsub('Yes', 1 , NEDA$Attrition)
NEDA$Attrition= as.numeric(NEDA$Attrition) #Numeric
##Ordinal Variables Trends We see an inverse realionship on most ordinal variables at the lower levels.
#Environment Satisfaction Visualized
NEDA %>% ggplot(aes(EnvironmentSatisfaction, Attrition)) + geom_smooth()+
ggtitle("Attrition and EnvironmentSatisfaction")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 4.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 3.5121e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 4.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 3.5121e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1
## Job Involvement An inverse realtionship between Job involvement and attrition can be seen.
NEDA %>% ggplot(aes(JobInvolvement, Attrition)) + geom_smooth()+
ggtitle("Attrition and JobInvolvement")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 0.985
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 2.3932e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4.0602
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 0.985
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 2.3932e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4.0602
## JobLevel Indirect relationship at the lower levels.
NEDA %>% ggplot(aes(JobLevel, Attrition)) + geom_smooth()+
ggtitle("Attrition and JobLevel")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 0.98
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.02
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 3.1629e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 0.98
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 2.02
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 3.1629e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4
There’s an inverse relationship between Job Satisfaction and attrition.
NEDA %>% ggplot(aes(JobSatisfaction, Attrition)) + geom_smooth()+
ggtitle("Attrition and JobSatisfaction")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 4.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 2.2758e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 4.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 2.2758e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1
## Stock Option Level We can see that the inverse relationship between stock option level and attrition is only seen at the lower levels.
NEDA %>% ggplot(aes(StockOptionLevel, Attrition)) + geom_smooth()+
ggtitle("Attrition and Stock Option Level")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at -0.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 1.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 1.5633e-30
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 1
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## -0.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 1.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 1.5633e-30
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 1
## WorkLifeBalance Work life balance follows the same trend as other ordinal variables. The change is mostly present at the lower levels.
NEDA %>% ggplot(aes(WorkLifeBalance, Attrition)) + geom_smooth()+
ggtitle("Attrition and WorkLifeBalance")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : pseudoinverse used at 0.985
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : neighborhood radius 2.015
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : reciprocal condition number 2.3264e-15
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric =
## parametric, : There are other near singularities as well. 4.0602
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## 0.985
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius
## 2.015
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 2.3264e-15
## Warning in predLoess(object$y, object$x, newx = if
## (is.null(newdata)) object$x else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 4.0602
##Continuous Variables
We continue visualizing our dataset to better understand the variables present. On this case we will use box plots to see any differences between the two groups in question.
##Age
This graph shows that the attrition group has a lower mean and median age than the no attrition group.
ggplot(EDA, aes(x=Attrition, y=Age, fill=Attrition)) +
geom_boxplot(notch=TRUE) +
stat_summary(fun=mean, geom="point", shape=18, size=10, color="black", fill="red")+
ggtitle("Attrition by Age")
#DistanceFromHome
Distance from home shows that those who quit tend to leave farther away from their workplace.
ggplot(EDA, aes(x=Attrition, y=DistanceFromHome, fill=Attrition)) +
geom_boxplot(notch=TRUE) +
stat_summary(fun=mean, geom="point", shape=18, size=10, color="black", fill="red")+
ggtitle("Attrition by DistanceFromHome")
##MonthlyIncome
Monthly Income shows that the attrition group tend to have a lower monthly income than the no attrition group.
ggplot(EDA, aes(x=Attrition, y=MonthlyIncome, fill=Attrition)) +
geom_boxplot(notch=TRUE) +
stat_summary(fun=mean, geom="point", shape=18, size=10, color="black", fill="red")+
ggtitle("Attrition by MonthlyIncome")
##YearsInCurrentRole
It can be seen that the employees in the attrition group tend to have less time on their role before they leave than those who don’t.
ggplot(EDA, aes(x=Attrition, y=YearsInCurrentRole, fill=Attrition)) +
geom_boxplot(notch=TRUE) +
stat_summary(fun=mean, geom="point", shape=18, size=10, color="black", fill="red")+
ggtitle("Attrition by YearsInCurrentRole")
##Predict Attrition
After visualizing our attrition factors and having a better understanding of our dataset we will attempt to predictr attrition by utilizing a couple of different predictive models.
In order to properly use our different kinds of variables to predict attrition we need to clean up and transform our data.
KNNEDA = EDA
#Drop Non Relevant Variables
drop= c("ID", "DailyRate", "Education", "Gender", "HourlyRate")
KNNEDA= KNNEDA[ , !(names(KNNEDA) %in% drop)]
KNNEDA$Attrition=factor(KNNEDA$Attrition)
#TransformCategorical Variables
KNNEDA$Non_Travel= ifelse(KNNEDA$BusinessTravel == "Non-Travel",1,0)
KNNEDA$Travel_Frequently= ifelse(KNNEDA$BusinessTravel == "Travel_Frequently",1,0)
KNNEDA$Travel_Rarely= ifelse(KNNEDA$BusinessTravel == "Travel_Rarely",1,0)
KNNEDA$HRDepartment= ifelse(KNNEDA$Department == "Human Resources",1,0)
KNNEDA$RDDepartment= ifelse(KNNEDA$Department == "Research & Development",1,0)
KNNEDA$SalesDepartment= ifelse(KNNEDA$Department == "Sales",1,0)
KNNEDA$HumanResourcesDegree= ifelse(KNNEDA$EducationField == "Human Resources",1,0)
KNNEDA$LifeSciencesDegree= ifelse(KNNEDA$EducationField == "Life Sciences",1,0)
KNNEDA$MarketingDegree= ifelse(KNNEDA$EducationField == "Marketing",1,0)
KNNEDA$MedicalDegree= ifelse(KNNEDA$EducationField == "Medical",1,0)
KNNEDA$OtherDegree= ifelse(KNNEDA$EducationField == "Other",1,0)
KNNEDA$TechnicalDegree= ifelse(KNNEDA$EducationField == "Technical Degree",1,0)
KNNEDA$HealthcareRep= ifelse(KNNEDA$JobRole == "Healthcare Representative",1,0)
KNNEDA$HumanResources= ifelse(KNNEDA$JobRole == "Human Resources",1,0)
KNNEDA$LaboratoryTechnician= ifelse(KNNEDA$JobRole == "Laboratory Technician",1,0)
KNNEDA$Manager= ifelse(KNNEDA$JobRole == "Manager",1,0)
KNNEDA$ManufacturingDirector= ifelse(KNNEDA$JobRole == "Manufacturing Director",1,0)
KNNEDA$ResearchDirector= ifelse(KNNEDA$JobRole == "Research Director",1,0)
KNNEDA$ResearchScientist= ifelse(KNNEDA$JobRole == "Research Scientist",1,0)
KNNEDA$SalesExec= ifelse(KNNEDA$JobRole == "Sales Executive",1,0)
KNNEDA$SalesRep= ifelse(KNNEDA$JobRole == "Sales Representative",1,0)
KNNEDA$Divorced= ifelse(KNNEDA$MaritalStatus == "Divorced",1,0)
KNNEDA$Married= ifelse(KNNEDA$MaritalStatus == "Married",1,0)
KNNEDA$Single= ifelse(KNNEDA$MaritalStatus == "Single",1,0)
KNNEDA$OverTime= ifelse(KNNEDA$OverTime == "Yes",1,0)
#Drop Transformed Variables
drop= c("BusinessTravel", "Department", "EducationField", "Gender", "JobRole", "MaritalStatus")
KNNEDA= KNNEDA[ , !(names(KNNEDA) %in% drop)]
KNNEDA= KNNEDA[,c(2, 1:46)]
KNNEDA= KNNEDA[ , !(names(KNNEDA)=="Attrition.1")]
Once our data is ready we can determine the top 3 factors for attrition. By running a generalized linear model and looking at factor’s p value, we see that the top 3 Factors are: OverTime, JobInvolvement, JobSatisfaction
model=glm(Attrition~.,data = KNNEDA, family= "binomial")
summary(model)
##
## Call:
## glm(formula = Attrition ~ ., family = "binomial", data = KNNEDA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7863 -0.4455 -0.2171 -0.0632 3.2645
##
## Coefficients: (5 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.310e+00 1.782e+00 4.103 4.07e-05 ***
## Age -3.033e-02 1.842e-02 -1.647 0.099654 .
## DistanceFromHome 5.075e-02 1.494e-02 3.398 0.000680 ***
## EnvironmentSatisfaction -3.478e-01 1.152e-01 -3.018 0.002541 **
## JobInvolvement -7.769e-01 1.701e-01 -4.568 4.92e-06 ***
## JobLevel -8.053e-02 4.469e-01 -0.180 0.857015
## JobSatisfaction -5.123e-01 1.129e-01 -4.538 5.67e-06 ***
## MonthlyIncome 9.571e-05 1.191e-04 0.804 0.421517
## MonthlyRate -1.518e-05 1.771e-05 -0.857 0.391324
## NumCompaniesWorked 2.377e-01 5.444e-02 4.366 1.27e-05 ***
## OverTime 1.971e+00 2.562e-01 7.691 1.46e-14 ***
## PercentSalaryHike -1.829e-02 5.296e-02 -0.345 0.729837
## PerformanceRating 3.508e-01 5.535e-01 0.634 0.526240
## RelationshipSatisfaction -2.468e-01 1.087e-01 -2.271 0.023139 *
## StockOptionLevel -7.907e-02 2.078e-01 -0.381 0.703571
## TotalWorkingYears -1.017e-01 4.168e-02 -2.439 0.014723 *
## TrainingTimesLastYear -2.821e-01 1.037e-01 -2.721 0.006507 **
## WorkLifeBalance -5.645e-01 1.687e-01 -3.347 0.000817 ***
## YearsAtCompany 8.341e-02 5.268e-02 1.583 0.113355
## YearsInCurrentRole -1.318e-01 6.027e-02 -2.187 0.028777 *
## YearsSinceLastPromotion 2.484e-01 5.938e-02 4.184 2.87e-05 ***
## YearsWithCurrManager -1.607e-01 6.314e-02 -2.544 0.010950 *
## Non_Travel -8.333e-01 4.518e-01 -1.844 0.065114 .
## Travel_Frequently 7.097e-01 3.170e-01 2.239 0.025169 *
## Travel_Rarely NA NA NA NA
## HRDepartment -1.395e+01 7.601e+02 -0.018 0.985359
## RDDepartment 1.414e-01 1.257e+00 0.112 0.910435
## SalesDepartment NA NA NA NA
## HumanResourcesDegree 9.643e-01 1.252e+00 0.770 0.441243
## LifeSciencesDegree -6.154e-01 4.295e-01 -1.433 0.151927
## MarketingDegree -5.060e-01 5.508e-01 -0.919 0.358258
## MedicalDegree -6.369e-01 4.406e-01 -1.445 0.148352
## OtherDegree -3.328e-01 6.131e-01 -0.543 0.587200
## TechnicalDegree NA NA NA NA
## HealthcareRep -2.170e+00 1.438e+00 -1.509 0.131232
## HumanResources 1.275e+01 7.601e+02 0.017 0.986619
## LaboratoryTechnician -1.163e+00 1.330e+00 -0.874 0.381960
## Manager -2.164e+00 1.390e+00 -1.557 0.119469
## ManufacturingDirector -3.828e+00 1.574e+00 -2.432 0.014999 *
## ResearchDirector -4.359e+00 1.965e+00 -2.219 0.026488 *
## ResearchScientist -1.832e+00 1.334e+00 -1.373 0.169825
## SalesExec -1.469e+00 5.521e-01 -2.660 0.007812 **
## SalesRep NA NA NA NA
## Divorced -1.776e+00 5.109e-01 -3.475 0.000511 ***
## Married -8.321e-01 3.295e-01 -2.526 0.011548 *
## Single NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 767.67 on 869 degrees of freedom
## Residual deviance: 472.49 on 829 degrees of freedom
## AIC: 554.49
##
## Number of Fisher Scoring iterations: 15
The first model used to predict attrition is KNN. KNN gave us an 80% Accuracy, High Sensitivity, but low specificity.
KNNEDA1= KNNEDA[, c(1,3,4,6, 11, 15,18, 23, 28, 34, 39, 43,46)]
set.seed(1)
confusionMatrix(table(knn.cv(KNNEDA1[,2:12],KNNEDA1$Attrition, k = 1), KNNEDA1$Attrition))
## Confusion Matrix and Statistics
##
##
## No Yes
## No 662 101
## Yes 68 39
##
## Accuracy : 0.8057
## 95% CI : (0.7779, 0.8315)
## No Information Rate : 0.8391
## P-Value [Acc > NIR] : 0.99611
##
## Kappa : 0.2049
##
## Mcnemar's Test P-Value : 0.01383
##
## Sensitivity : 0.9068
## Specificity : 0.2786
## Pos Pred Value : 0.8676
## Neg Pred Value : 0.3645
## Prevalence : 0.8391
## Detection Rate : 0.7609
## Detection Prevalence : 0.8770
## Balanced Accuracy : 0.5927
##
## 'Positive' Class : No
##
Our last model Naive Bayes gave us the best results. We achieved: 86% Accuracy 91% Sensitivity 58% Specificity
set.seed(1)
trainIndices = sample(1:dim(KNNEDA)[1],round(0.7 * dim(KNNEDA)[1]))
train = KNNEDA[trainIndices,]
test = KNNEDA[-trainIndices,]
model_train = naiveBayes(Attrition~.,data = train)
predict(model_train, train[,c(4,5,6,8,10,11,15,18,19,24,43,46)])
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'Age'. Did you use factors
## with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'DistanceFromHome'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'JobSatisfaction'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'MonthlyRate'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'PercentSalaryHike'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'PerformanceRating'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'RelationshipSatisfaction'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'TotalWorkingYears'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'TrainingTimesLastYear'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'YearsInCurrentRole'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'YearsSinceLastPromotion'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'YearsWithCurrManager'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'Non_Travel'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'Travel_Rarely'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'HRDepartment'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'RDDepartment'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'SalesDepartment'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'HumanResourcesDegree'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'LifeSciencesDegree'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'MarketingDegree'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'MedicalDegree'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'OtherDegree'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'TechnicalDegree'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'HealthcareRep'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'HumanResources'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'LaboratoryTechnician'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, :
## Type mismatch between training and new data for variable 'Manager'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'ManufacturingDirector'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'ResearchDirector'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'ResearchScientist'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'SalesExec'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, : Type
## mismatch between training and new data for variable 'Divorced'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_train, train[, c(4, 5, 6, 8, 10, 11, :
## Type mismatch between training and new data for variable 'Married'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## [1] No No Yes Yes No Yes No No No No No No No No No No No No
## [19] No Yes No No No No Yes No No No No No No No No No No No
## [37] No No Yes No No Yes Yes No No No No No Yes No No No No No
## [55] No No No No No No No No No No No No No Yes No No Yes No
## [73] No No No No No No No No No No No No No Yes No No Yes No
## [91] No No No No No No No No No Yes Yes No No No No No No No
## [109] No No No No No No No No Yes No No No No No No No No No
## [127] No No No No No No No No Yes No No No No No No No No No
## [145] No No No No Yes No No No No No No No No Yes No No Yes No
## [163] No No No No No No Yes No No No No No No No Yes No No No
## [181] No No No No No No Yes No No No No No Yes No No Yes No No
## [199] Yes No No No No No No No No No Yes No No No No No No No
## [217] No No Yes No No No No Yes No No No No No No No No No No
## [235] No No No No No No No No No No No No Yes No Yes No No No
## [253] No No No No No No No No No No No No No No No No No No
## [271] No No No Yes No No Yes No No No No No No No No No No No
## [289] No No No No No No No No No No No Yes No No No No No Yes
## [307] Yes No No No No No No No No Yes No Yes No Yes No No No No
## [325] No No No No No No No No No No No No No Yes No No No No
## [343] No No No No No No No No No No No No No No No No No No
## [361] No No No No No Yes No No No No No No No No No No No No
## [379] No No Yes No Yes No No No No No No No No No No No No No
## [397] No No Yes No Yes No No No No No No No No Yes No No No No
## [415] No No No No No No Yes No No No No No No No No No No No
## [433] No No No No No No No No No No No No No No No No Yes No
## [451] No No Yes No No No No Yes No No No Yes No No No No Yes Yes
## [469] No No No No No No Yes No No No No Yes No No No No No No
## [487] No No No No No No No No No No No No Yes No No No No No
## [505] No No No No No No No Yes No No No No No No No Yes No Yes
## [523] No No No No No No Yes No No No No No No No No No No No
## [541] No No No No No No No No Yes No No No No No No No No No
## [559] No No No No No No No No No No No No No Yes Yes No No No
## [577] No No Yes No No No No No No No No No No No No No No Yes
## [595] No No No No Yes No No No No No No No No Yes No
## Levels: No Yes
model_test = naiveBayes(Attrition~.,data = test)
classifications=predict(model_test, test[,c(4,5,6,8,10,11,15,18,19,24,43,46)])
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'Age'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'DistanceFromHome'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'JobSatisfaction'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'MonthlyRate'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'PercentSalaryHike'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'PerformanceRating'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10,
## 11, 15, : Type mismatch between training and new data for variable
## 'RelationshipSatisfaction'. Did you use factors with numeric labels for
## training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'TotalWorkingYears'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10,
## 11, 15, : Type mismatch between training and new data for variable
## 'TrainingTimesLastYear'. Did you use factors with numeric labels for training,
## and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'YearsInCurrentRole'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10,
## 11, 15, : Type mismatch between training and new data for variable
## 'YearsSinceLastPromotion'. Did you use factors with numeric labels for training,
## and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'YearsWithCurrManager'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'Non_Travel'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'Travel_Rarely'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'HRDepartment'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'RDDepartment'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'SalesDepartment'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'HumanResourcesDegree'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'LifeSciencesDegree'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'MarketingDegree'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'MedicalDegree'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'OtherDegree'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'TechnicalDegree'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'HealthcareRep'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'HumanResources'. Did
## you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'LaboratoryTechnician'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'Manager'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10,
## 11, 15, : Type mismatch between training and new data for variable
## 'ManufacturingDirector'. Did you use factors with numeric labels for training,
## and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'ResearchDirector'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'ResearchScientist'.
## Did you use factors with numeric labels for training, and numeric values for new
## data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'SalesExec'. Did you
## use factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'Divorced'. Did you use
## factors with numeric labels for training, and numeric values for new data?
## Warning in predict.naiveBayes(model_test, test[, c(4, 5, 6, 8, 10, 11, 15, :
## Type mismatch between training and new data for variable 'Married'. Did you use
## factors with numeric labels for training, and numeric values for new data?
CM_NaiveBayes=confusionMatrix(table(classifications,test$Attrition))
CM_NaiveBayes
## Confusion Matrix and Statistics
##
##
## classifications No Yes
## No 201 18
## Yes 19 23
##
## Accuracy : 0.8582
## 95% CI : (0.8099, 0.8982)
## No Information Rate : 0.8429
## P-Value [Acc > NIR] : 0.2801
##
## Kappa : 0.4699
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.9136
## Specificity : 0.5610
## Pos Pred Value : 0.9178
## Neg Pred Value : 0.5476
## Prevalence : 0.8429
## Detection Rate : 0.7701
## Detection Prevalence : 0.8391
## Balanced Accuracy : 0.7373
##
## 'Positive' Class : No
##
In our analysis we found it interesting that when it comes ordinal variables, the lowest level is the one that seems to have the most difference between those who stay and those who don’t stay.
In general the Attrition group tends to have more employees who rank on the bottom level of the lowest level of the variable. We don’t see that same difference on the rest of the levels.
Our recommendation is to focus on those employees that have no stock level, or have the lowest rank in Environment Satisfaction, Job Involvement, Job Level, Job Satisfaction, and work life balance.
If we can move those employees just one level higher we can have an impact on attrition.
Below are some graphs visualizing those differences.
ggplot(EDA,mapping=aes(x=Attrition,fill=as.character(EnvironmentSatisfaction)))+
geom_bar(position="fill")+
ggtitle("Attrition by EnvironmentSatisfaction")+
ylab("Percentage")+
scale_y_continuous(labels = scales::percent_format())
ggplot(EDA,mapping=aes(x=Attrition,fill=as.character(JobInvolvement)))+
geom_bar(position="fill")+
ggtitle("Attrition by JobInvolvement")+
ylab("Percentage")+
scale_y_continuous(labels = scales::percent_format())
ggplot(EDA,mapping=aes(x=Attrition,fill=as.character(JobLevel)))+
geom_bar(position="fill")+
ggtitle("Attrition by JobLevel")+
ylab("Percentage")+
scale_y_continuous(labels = scales::percent_format())
ggplot(EDA,mapping=aes(x=Attrition,fill=as.character(JobSatisfaction)))+
geom_bar(position="fill")+
ggtitle("Attrition by JobSatisfaction")+
ylab("Percentage")+
scale_y_continuous(labels = scales::percent_format())
ggplot(EDA,mapping=aes(x=Attrition,fill=as.character(StockOptionLevel)))+
geom_bar(position="fill")+
ggtitle("Attrition by StockOptionLevel")+
ylab("Percentage")+
scale_y_continuous(labels = scales::percent_format())
ggplot(EDA,mapping=aes(x=Attrition,fill=as.character(WorkLifeBalance)))+
geom_bar(position="fill")+
ggtitle("Attrition by WorkLifeBalance")+
ylab("Percentage")+
scale_y_continuous(labels = scales::percent_format())